Modelling the demographic history of human North African genomes points to a recent soft split divergence between populations

Background North African human populations present a complex demographic scenario due to the presence of an autochthonous genetic component and population substructure, plus extensive gene flow from the Middle East, Europe, and sub-Saharan Africa. Results We conducted a comprehensive analysis of 364 genomes to construct detailed demographic models for the North African region, encompassing its two primary ethnic groups, the Arab and Amazigh populations. This was achieved through an Approximate Bayesian Computation with Deep Learning (ABC-DL) framework and a novel algorithm called Genetic Programming for Population Genetics (GP4PG). This innovative approach enabled us to effectively model intricate demographic scenarios, utilizing a subset of 16 whole genomes at > 30X coverage. The demographic model suggested by GP4PG exhibited a closer alignment with the observed data compared to the ABC-DL model. Both point to a back-to-Africa origin of North African individuals and a close relationship with Eurasian populations. Results support different origins for Amazigh and Arab populations, with Amazigh populations originating back in Epipaleolithic times, while GP4PG supports Arabization as the main source of Middle Eastern ancestry. The GP4PG model includes population substructure in surrounding populations (sub-Saharan Africa and Middle East) with continuous decaying gene flow after population split. Contrary to ABC-DL, the best GP4PG model does not require pulses of admixture from surrounding populations into North Africa pointing to soft splits as drivers of divergence in North Africa. Conclusions We have built a demographic model on North Africa that points to a back-to-Africa expansion and a differential origin between Arab and Amazigh populations. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03341-4.

NeSan U(10000.0,90000.0)X X X X X X X NeYoruba (WAf) U(10000.0,90000.0)X X X X X X X NeLuhya (EAf) U(10000.0,60000.0)X X X X X X X NeTunisia_Chenini (NAfb) U(1000.0,20000.0)X X X X X X X NeTunisia (NAfa) U(1000.0,40000.0)X X X X X X X NeQatar (ME) U(1000.0,40000.0)X X X X X X X NeCEU (EU) U(1000.0,40000.0)X X X X X X X NeHan (EAs) U(1000.0,40000.0)X X X X X X X

Fig. S 5 :
Fig. S 5: Replication PCA for Model D_4 in ABC-DL analysis.PCA for 1000 simulations of the model D_4, -the best model in the ABC-DL analysis-and the replication dataset of observed data.Observed data is an outlier in the PCA indicating that the ABC-DL model cannot properly replicate the diversity observed in the dataset.

Fig. S 6 :
Fig. S 6: Box plot of the distances between each simulation in the PCA and the centroid of the PCA.The red dot represents the observed data as an outlier of the distances.

Fig. S 8 :
Fig. S 8: Coordinates of the different ecodemes we are testing in the GP4PG analysis.Each ecodeme has the exact same size and major geographical barriers such as seas and deserts has been removed for the sake of simplicity.

Fig. S 9 :
Fig. S 9: Fitness of the different runs of the genetic algorithm.a. Distribution of the fitness error of 40 independent iterations of the GP4PG algorithm with 6 competing topologies (B to G in ABC-DL) during 200 generations.Model D appears as the most selected model in a fourth of all the iterations, with D_15 as the model with the least error.b.PCA plot comparing the jSFS obtained from simulations of the best ABC-DL model with the 10 best GP4PG models.GP4PG simulations explain the observed data better than the ABC-DL.c. Same PCA plot as b but not including the simulations from ABC-DL result.Models C_29 and C_39 are the ones that show a more similar jSFS to the one produced by the observed data.

Fig. S 10 :
Fig. S 10: Observed heterozygosity per individual compared by superpopulation.Sub-Saharan populations present a higher heterozygosity than Eurasian populations, North African individuals have heterozygosity levels between the sub-Saharan and the Eurasians, probably due to gene flow from sub-Sharan populations to north African individuals.
and prior distributions of the seven considered models in Fig.S4

Table S1 : Confusion matrix computed with the 7 models under evaluation. 50 randomly sampled simulations per model were used as
"observed" data for the ABC-DL algorithm.Diagonal, in bold, shows the probability of a model being correctly assigned by the A

Table S 2:Proportion of accepted simulations using postpr function for the "abc" package with tolerance = 0.0008.
Model D is present 92.2% of times in the 1000 closest simulations to the observed data.

Table S 3: Bayes factor for the ABC-DL topology discrimination analysis. Model
D is 11.8 times better at explaining the observed data than the second-best model (Model F).

Table S 4: Confusion matrix computed with the five D models under evaluation. 50 randomly sampled simulations per model were used as "observed" data for the ABC-DL algorithm.
Diagonal, in bold, shows the probability of a model being correctly assigned by the ABC.

Table S 5: Proportion of accepted simulations using postpr function for the "abc" package with tolerance = 0.001.
Model D4 is present 76.22% of times in the 1000 closest simulations to the observed data.

Table S 6: Bayes factor for the ABC-DL with different admixture patterns
. Model D4 is 8.074 times better at explaining the observed data than the second-best model (Model D3).